103 research outputs found

    Linear Time Construction of Cover Suffix Tree and Applications

    Get PDF

    Linear Time Construction of Cover Suffix Tree and Applications

    Full text link
    The Cover Suffix Tree (CST) of a string TT is the suffix tree of TT with additional explicit nodes corresponding to halves of square substrings of TT. In the CST an explicit node corresponding to a substring CC of TT is annotated with two numbers: the number of non-overlapping consecutive occurrences of CC and the total number of positions in TT that are covered by occurrences of CC in TT. Kociumaka et al. (Algorithmica, 2015) have shown how to compute the CST of a length-nn string in O(nlogn)O(n \log n) time. We show how to compute the CST in O(n)O(n) time assuming that TT is over an integer alphabet. Kociumaka et al. (Algorithmica, 2015; Theor. Comput. Sci., 2018) have shown that knowing the CST of a length-nn string TT, one can compute a linear-sized representation of all seeds of TT as well as all shortest α\alpha-partial covers and seeds in TT for a given α\alpha in O(n)O(n) time. Thus our result implies linear-time algorithms computing these notions of quasiperiodicity. The resulting algorithm computing seeds is substantially different from the previous one (Kociumaka et al., SODA 2012, ACM Trans. Algorithms, 2020). Kociumaka et al. (Algorithmica, 2015) proposed an O(nlogn)O(n \log n)-time algorithm for computing a shortest α\alpha-partial cover for each α=1,,n\alpha=1,\ldots,n; we improve this complexity to O(n)O(n). Our results are based on a new characterization of consecutive overlapping occurrences of a substring SS of TT in terms of the set of runs (see Kolpakov and Kucherov, FOCS 1999) in TT. This new insight also leads to an O(n)O(n)-sized index for reporting overlapping consecutive occurrences of a given pattern PP of length mm in O(m+output)O(m+output) time, where outputoutput is the number of occurrences reported. In comparison, a general index for reporting bounded-gap consecutive occurrences of Navarro and Thankachan (Theor. Comput. Sci., 2016) uses O(nlogn)O(n \log n) space.Comment: Accepted to ESA 2023. Abstract abridged to satisfy arxiv requirement

    Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

    Get PDF
    We study pattern matching problems on two major representations of uncertain sequences used in molecular biology: weighted sequences (also known as position weight matrices, PWM) and profiles (i.e., scoring matrices). In the simple version, in which only the pattern or only the text is uncertain, we obtain efficient algorithms with theoretically-provable running times using a variation of the lookahead scoring technique. We also consider a general variant of the pattern matching problems in which both the pattern and the text are uncertain. Central to our solution is a special case where the sequences have equal length, called the consensus problem. We propose algorithms for the consensus problem parameterized by the number of strings that match one of the sequences. As our basic approach, a careful adaptation of the classic meet-in-the-middle algorithm for the knapsack problem is used. On the lower bound side, we prove that our dependence on the parameter is optimal up to lower-order terms conditioned on the optimality of the original algorithm for the knapsack problem.Comment: 22 page

    Internal Pattern Matching Queries in a Text and Applications

    Full text link
    We consider several types of internal queries: questions about subwords of a text. As the main tool we develop an optimal data structure for the problem called here internal pattern matching. This data structure provides constant-time answers to queries about occurrences of one subword xx in another subword yy of a given text, assuming that y=O(x)|y|=\mathcal{O}(|x|), which allows for a constant-space representation of all occurrences. This problem can be viewed as a natural extension of the well-studied pattern matching problem. The data structure has linear size and admits a linear-time construction algorithm. Using the solution to the internal pattern matching problem, we obtain very efficient data structures answering queries about: primitivity of subwords, periods of subwords, general substring compression, and cyclic equivalence of two subwords. All these results improve upon the best previously known counterparts. The linear construction time of our data structure also allows to improve the algorithm for finding δ\delta-subrepetitions in a text (a more general version of maximal repetitions, also called runs). For any fixed δ\delta we obtain the first linear-time algorithm, which matches the linear time complexity of the algorithm computing runs. Our data structure has already been used as a part of the efficient solutions for subword suffix rank & selection, as well as substring compression using Burrows-Wheeler transform composed with run-length encoding.Comment: 31 pages, 9 figures; accepted to SODA 201
    corecore